Research Questions

Q1: Which predictive model performs best to estimate EV ownership rates across California ZIP codes?

Q2: At what charger count per 1,000 people does the marginal lift on EV penetration diminish across California ZIP codes?

Q3: Does the impact of charger access on EV adoption vary by income level?

Q4: Can we identify region-specific incentives to effectively accelerate EV adoption?

Q5: Can we predict which ZIP codes hold the greatest untapped EV-adoption potential to prioritize charger investments?

Exploratory Data Analysis

# Keep only numeric columns and remove NA
numeric_data <- ev_data %>% select_if(is.numeric) %>% na.omit()
cor_matrix <- cor(numeric_data)

# Plot correlation matrix
corrplot(cor_matrix, method = "circle", type = "lower", tl.cex = 0.8)

# Scatterplots
# EV Count vs Charger Count
ggplot(ev_data, aes(x = ev_charging_station_count, y = total_ev_count)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE, color = "blue") +
  labs(title = "EV Count vs Charger Count", x = "Charger Count", y = "EV Count")
## `geom_smooth()` using formula = 'y ~ x'

# Income vs EV Count
ggplot(ev_data, aes(x = median_household_income, y = total_ev_count)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE, color = "darkgreen") +
  labs(title = "EV Count vs Median Income", x = "Median Income", y = "EV Count")
## `geom_smooth()` using formula = 'y ~ x'

# Step 5: Histogram of EV counts
ggplot(ev_data, aes(x = total_ev_count)) +
  geom_histogram(binwidth = 500, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Total EV Count", x = "EV Count", y = "Frequency")

# Step 6: Boxplots
# Boxplot of Median Income
ggplot(ev_data, aes(y = median_household_income)) +
  geom_boxplot(fill = "orange") +
  labs(title = "Boxplot of Median Household Income", y = "Median Income")

# Boxplot of % with Bachelor's or Higher
ggplot(ev_data, aes(y = percent_with_bachelor_s_or_higher)) +
  geom_boxplot(fill = "lightgreen") +
  labs(title = "Boxplot of % with Bachelor's or Higher", y = "% Bachelor's+")

# Step 7: Segment ZIPs by income quantiles
ev_data <- ev_data %>%
  mutate(Income_Quantile = ntile(median_household_income, 4))
# EV Count distribution by income segment
ggplot(ev_data, aes(x = factor(Income_Quantile), y = total_ev_count)) +
  geom_boxplot(fill = "steelblue") +
  labs(title = "EV Count by Income Quantile", x = "Income Quantile", y = "Total EV Count")

Simple Linear Regression

# Fit the linear model
model <- lm(`total_ev_count` ~ `ev_charging_station_count`, data = ev_data)

# Summary statistics
summary(model)
## 
## Call:
## lm(formula = total_ev_count ~ ev_charging_station_count, data = ev_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7454.8  -823.4  -490.4   521.7 13493.9 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                837.440     37.113   22.57   <2e-16 ***
## ev_charging_station_count   30.981      1.527   20.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1311 on 1559 degrees of freedom
## Multiple R-squared:  0.209,  Adjusted R-squared:  0.2085 
## F-statistic: 411.8 on 1 and 1559 DF,  p-value: < 2.2e-16
# Plot with regression line
ggplot(ev_data, aes(x = `ev_charging_station_count`, y = `total_ev_count`)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE, color = "blue") +
  labs(
    title = "EV Count vs Charging Station Count (California ZIPs)",
    x = "EV Charging Station Count",
    y = "Total EV Count"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

This simple linear regression shows a positive relationship between EV charging station count and total EV count across ZIP codes in California. The model suggests that for each additional charger, the number of EVs increases by approximately 31 vehicles. However, the R² value of 0.21 indicates that charger count alone explains only 21% of the variance in EV count, implying that other factors also play a significant role.

Multiple Linear Regression

# Fit the original multiple linear regression model
model_original <- lm(total_ev_count ~ ev_charging_station_count + median_household_income +
                       population + percent_with_bachelor_s_or_higher,
                     data = ev_data)

# Display the summary of the model
summary(model_original)
## 
## Call:
## lm(formula = total_ev_count ~ ev_charging_station_count + median_household_income + 
##     population + percent_with_bachelor_s_or_higher, data = ev_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3611.0  -431.8   -43.9   367.7 13180.2 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       -1.348e+03  5.920e+01  -22.77   <2e-16 ***
## ev_charging_station_count          1.248e+01  1.085e+00   11.50   <2e-16 ***
## median_household_income            9.164e-03  7.998e-04   11.46   <2e-16 ***
## population                         3.492e-02  1.016e-03   34.36   <2e-16 ***
## percent_with_bachelor_s_or_higher  1.777e+01  1.648e+00   10.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 851.9 on 1556 degrees of freedom
## Multiple R-squared:  0.6668, Adjusted R-squared:  0.666 
## F-statistic: 778.5 on 4 and 1556 DF,  p-value: < 2.2e-16

This multiple linear regression model predicts the total number of EVs in a ZIP code using four factors: charger count, income, population, and education level. All predictors are statistically significant (p < 0.001), with population and education having the strongest effects. The model explains about 66.6% of the variation in EV counts (R² = 0.666), indicating a strong overall fit.

Decision Tree

# Fit a regression tree
tree_model <- rpart(total_ev_count ~ ev_charging_station_count + median_household_income +
                       population + percent_with_bachelor_s_or_higher,
                    data = ev_data, method = "anova")

# Plot the tree
rpart.plot(tree_model)

# Model summary
summary(tree_model)
## Call:
## rpart(formula = total_ev_count ~ ev_charging_station_count + 
##     median_household_income + population + percent_with_bachelor_s_or_higher, 
##     data = ev_data, method = "anova")
##   n= 1561 
## 
##            CP nsplit rel error    xerror       xstd
## 1  0.34204245      0 1.0000000 1.0008850 0.09940108
## 2  0.21961702      1 0.6579575 0.6610774 0.08428779
## 3  0.05208325      2 0.4383405 0.4444715 0.05891020
## 4  0.05179741      3 0.3862573 0.4183118 0.05757213
## 5  0.02114098      4 0.3344599 0.3504051 0.05725311
## 6  0.02064752      5 0.3133189 0.3322708 0.05668389
## 7  0.01848874      6 0.2926714 0.3188947 0.05553384
## 8  0.01845988      7 0.2741826 0.3108401 0.05542032
## 9  0.01841275      8 0.2557227 0.3108401 0.05542032
## 10 0.01522151      9 0.2373100 0.3087193 0.05559094
## 11 0.01013693     10 0.2220885 0.2928868 0.05552059
## 12 0.01000000     11 0.2119515 0.2880775 0.05553571
## 
## Variable importance
##                        population           median_household_income 
##                                38                                30 
##         ev_charging_station_count percent_with_bachelor_s_or_higher 
##                                18                                14 
## 
## Node number 1: 1561 observations,    complexity param=0.3420425
##   mean=1174.305, MSE=2171451 
##   left son=2 (732 obs) right son=3 (829 obs)
##   Primary splits:
##       population                        < 19027    to the left,  improve=0.3420425, (0 missing)
##       median_household_income           < 104141.5 to the left,  improve=0.2352362, (0 missing)
##       ev_charging_station_count         < 5.5      to the left,  improve=0.2215987, (0 missing)
##       percent_with_bachelor_s_or_higher < 37.35    to the left,  improve=0.2047538, (0 missing)
##   Surrogate splits:
##       ev_charging_station_count         < 1.5      to the left,  agree=0.764, adj=0.496, (0 split)
##       median_household_income           < 73495    to the left,  agree=0.622, adj=0.194, (0 split)
##       percent_with_bachelor_s_or_higher < 19.2     to the left,  agree=0.582, adj=0.109, (0 split)
## 
## Node number 2: 732 observations,    complexity param=0.01845988
##   mean=257.1626, MSE=202879.5 
##   left son=4 (580 obs) right son=5 (152 obs)
##   Primary splits:
##       population                        < 10806.5  to the left,  improve=0.4213400, (0 missing)
##       percent_with_bachelor_s_or_higher < 44.4     to the left,  improve=0.3221750, (0 missing)
##       median_household_income           < 104143   to the left,  improve=0.2822618, (0 missing)
##       ev_charging_station_count         < 5.5      to the left,  improve=0.2151014, (0 missing)
##   Surrogate splits:
##       ev_charging_station_count         < 7.5      to the left,  agree=0.818, adj=0.125, (0 split)
##       percent_with_bachelor_s_or_higher < 77.15    to the left,  agree=0.806, adj=0.066, (0 split)
##       median_household_income           < 222232.5 to the left,  agree=0.795, adj=0.013, (0 split)
## 
## Node number 3: 829 observations,    complexity param=0.219617
##   mean=1984.134, MSE=2511132 
##   left son=6 (675 obs) right son=7 (154 obs)
##   Primary splits:
##       median_household_income           < 132272   to the left,  improve=0.35759790, (0 missing)
##       percent_with_bachelor_s_or_higher < 39.75    to the left,  improve=0.27662030, (0 missing)
##       ev_charging_station_count         < 71.5     to the left,  improve=0.12482850, (0 missing)
##       population                        < 42302.5  to the left,  improve=0.07047139, (0 missing)
##   Surrogate splits:
##       percent_with_bachelor_s_or_higher < 63.55    to the left,  agree=0.866, adj=0.279, (0 split)
##       ev_charging_station_count         < 71.5     to the left,  agree=0.823, adj=0.045, (0 split)
## 
## Node number 4: 580 observations
##   mean=107.4897, MSE=47005.26 
## 
## Node number 5: 152 observations,    complexity param=0.01013693
##   mean=828.2829, MSE=386003.1 
##   left son=10 (72 obs) right son=11 (80 obs)
##   Primary splits:
##       median_household_income           < 97910    to the left,  improve=0.58563220, (0 missing)
##       percent_with_bachelor_s_or_higher < 46.4     to the left,  improve=0.56500370, (0 missing)
##       ev_charging_station_count         < 7.5      to the left,  improve=0.10122630, (0 missing)
##       population                        < 15706    to the left,  improve=0.04330376, (0 missing)
##   Surrogate splits:
##       percent_with_bachelor_s_or_higher < 38.2     to the left,  agree=0.862, adj=0.708, (0 split)
##       ev_charging_station_count         < 2.5      to the left,  agree=0.572, adj=0.097, (0 split)
##       population                        < 11162.5  to the left,  agree=0.553, adj=0.056, (0 split)
## 
## Node number 6: 675 observations,    complexity param=0.05179741
##   mean=1531.507, MSE=1003136 
##   left son=12 (378 obs) right son=13 (297 obs)
##   Primary splits:
##       median_household_income           < 94221    to the left,  improve=0.25929680, (0 missing)
##       percent_with_bachelor_s_or_higher < 25.75    to the left,  improve=0.25161610, (0 missing)
##       population                        < 43010.5  to the left,  improve=0.10249860, (0 missing)
##       ev_charging_station_count         < 8.5      to the left,  improve=0.09631858, (0 missing)
##   Surrogate splits:
##       percent_with_bachelor_s_or_higher < 29.95    to the left,  agree=0.766, adj=0.468, (0 split)
##       ev_charging_station_count         < 15.5     to the left,  agree=0.591, adj=0.071, (0 split)
##       population                        < 33388.5  to the right, agree=0.576, adj=0.037, (0 split)
## 
## Node number 7: 154 observations,    complexity param=0.05208325
##   mean=3968.052, MSE=4286946 
##   left son=14 (133 obs) right son=15 (21 obs)
##   Primary splits:
##       population                        < 57747.5  to the left,  improve=0.26741290, (0 missing)
##       ev_charging_station_count         < 46.5     to the left,  improve=0.20844630, (0 missing)
##       percent_with_bachelor_s_or_higher < 57.3     to the left,  improve=0.04024330, (0 missing)
##       median_household_income           < 199778.5 to the left,  improve=0.02884783, (0 missing)
##   Surrogate splits:
##       percent_with_bachelor_s_or_higher < 81.1     to the left,  agree=0.87, adj=0.048, (0 split)
## 
## Node number 10: 72 observations
##   mean=327.1111, MSE=63243.71 
## 
## Node number 11: 80 observations
##   mean=1279.338, MSE=246980.4 
## 
## Node number 12: 378 observations,    complexity param=0.01522151
##   mean=1079.431, MSE=571313 
##   left son=24 (152 obs) right son=25 (226 obs)
##   Primary splits:
##       percent_with_bachelor_s_or_higher < 18.45    to the left,  improve=0.23891580, (0 missing)
##       median_household_income           < 71361.5  to the left,  improve=0.16708240, (0 missing)
##       population                        < 43010.5  to the left,  improve=0.13855590, (0 missing)
##       ev_charging_station_count         < 8.5      to the left,  improve=0.08427328, (0 missing)
##   Surrogate splits:
##       median_household_income   < 68732    to the left,  agree=0.717, adj=0.296, (0 split)
##       ev_charging_station_count < 0.5      to the left,  agree=0.640, adj=0.105, (0 split)
##       population                < 64181.5  to the right, agree=0.616, adj=0.046, (0 split)
## 
## Node number 13: 297 observations,    complexity param=0.02064752
##   mean=2106.875, MSE=961570.4 
##   left son=26 (175 obs) right son=27 (122 obs)
##   Primary splits:
##       population                        < 40542.5  to the left,  improve=0.24506620, (0 missing)
##       ev_charging_station_count         < 7.5      to the left,  improve=0.09661443, (0 missing)
##       percent_with_bachelor_s_or_higher < 35.85    to the left,  improve=0.08413856, (0 missing)
##       median_household_income           < 104141.5 to the left,  improve=0.06311671, (0 missing)
##   Surrogate splits:
##       percent_with_bachelor_s_or_higher < 30.55    to the right, agree=0.626, adj=0.090, (0 split)
##       ev_charging_station_count         < 10.5     to the left,  agree=0.616, adj=0.066, (0 split)
##       median_household_income           < 94774.5  to the right, agree=0.609, adj=0.049, (0 split)
## 
## Node number 14: 133 observations,    complexity param=0.01848874
##   mean=3542.602, MSE=2795009 
##   left son=28 (90 obs) right son=29 (43 obs)
##   Primary splits:
##       population                        < 39917.5  to the left,  improve=0.16858750, (0 missing)
##       ev_charging_station_count         < 32       to the left,  improve=0.12272740, (0 missing)
##       percent_with_bachelor_s_or_higher < 63.75    to the left,  improve=0.05862379, (0 missing)
##       median_household_income           < 133571.5 to the right, improve=0.04976197, (0 missing)
##   Surrogate splits:
##       percent_with_bachelor_s_or_higher < 48.45    to the right, agree=0.707, adj=0.093, (0 split)
##       median_household_income           < 133571.5 to the right, agree=0.699, adj=0.070, (0 split)
##       ev_charging_station_count         < 90.5     to the left,  agree=0.692, adj=0.047, (0 split)
## 
## Node number 15: 21 observations,    complexity param=0.02114098
##   mean=6662.571, MSE=5329057 
##   left son=30 (14 obs) right son=31 (7 obs)
##   Primary splits:
##       ev_charging_station_count         < 45.5     to the left,  improve=0.64033680, (0 missing)
##       percent_with_bachelor_s_or_higher < 65.25    to the left,  improve=0.29011430, (0 missing)
##       population                        < 67824.5  to the right, improve=0.20872200, (0 missing)
##       median_household_income           < 142528   to the left,  improve=0.06478896, (0 missing)
##   Surrogate splits:
##       median_household_income           < 201254.5 to the left,  agree=0.810, adj=0.429, (0 split)
##       percent_with_bachelor_s_or_higher < 72.1     to the left,  agree=0.762, adj=0.286, (0 split)
##       population                        < 61227    to the right, agree=0.714, adj=0.143, (0 split)
## 
## Node number 24: 152 observations
##   mean=628.9342, MSE=148051.4 
## 
## Node number 25: 226 observations
##   mean=1382.42, MSE=627686.4 
## 
## Node number 26: 175 observations
##   mean=1701.56, MSE=418665.1 
## 
## Node number 27: 122 observations
##   mean=2688.27, MSE=1166659 
## 
## Node number 28: 90 observations
##   mean=3068.122, MSE=1092128 
## 
## Node number 29: 43 observations,    complexity param=0.01841275
##   mean=4535.698, MSE=4901734 
##   left son=58 (31 obs) right son=59 (12 obs)
##   Primary splits:
##       percent_with_bachelor_s_or_higher < 63.95    to the left,  improve=0.29611030, (0 missing)
##       ev_charging_station_count         < 43.5     to the left,  improve=0.24492130, (0 missing)
##       median_household_income           < 135695.5 to the right, improve=0.07079076, (0 missing)
##       population                        < 42529.5  to the right, improve=0.05651382, (0 missing)
##   Surrogate splits:
##       median_household_income   < 170595.5 to the left,  agree=0.860, adj=0.500, (0 split)
##       ev_charging_station_count < 43.5     to the left,  agree=0.814, adj=0.333, (0 split)
##       population                < 41009    to the right, agree=0.767, adj=0.167, (0 split)
## 
## Node number 30: 14 observations
##   mean=5356.357, MSE=1724195 
## 
## Node number 31: 7 observations
##   mean=9275, MSE=2301607 
## 
## Node number 58: 31 observations
##   mean=3786.129, MSE=596088.4 
## 
## Node number 59: 12 observations
##   mean=6472.083, MSE=1.082361e+07

This decision tree regression model predicts the total number of EVs in a ZIP code using factors like population, median income, education, and charger availability.

The first split is on population: ZIP codes with fewer than ~19,000 people tend to have fewer EVs, while those above split further.

From there, the tree uses median household income, education levels (% with bachelor’s or higher), and charging station count to refine predictions.

Each terminal node (leaf) represents the average EV count in ZIPs that follow that path.

The model shows how multiple variables interact nonlinearly to affect EV adoption, revealing that population size is the strongest predictor, followed by income and education.

This tree helps visually identify which demographic and infrastructure characteristics drive higher EV ownership.

Neural Network

# Normalize data
normalize <- function(x) {(x - min(x)) / (max(x) - min(x))}
ev_data_norm <- as.data.frame(lapply(ev_data[, c(2:6)], normalize))  # only numeric columns
set.seed(123)
# Fit neural network
nn_model <- neuralnet(total_ev_count ~ ev_charging_station_count + median_household_income +
                       population + percent_with_bachelor_s_or_higher,
                      data = ev_data_norm, hidden = c(5, 3), linear.output = TRUE)

# Plot the network
plot(nn_model)

This neural network model uses EV charging station count, income, population, and education to predict total EV ownership.

The model has two hidden layers (with 5 and 3 neurons), allowing it to capture complex nonlinear relationships between variables.

The error value (0.899979) reflects the model’s prediction loss after training.

Each line shows a connection weight — stronger weights indicate stronger influence.

This approach can uncover subtle patterns and interactions that linear models might miss, but it’s harder to interpret directly.

Random Forest

# Remove unnecessary column
ev_data_rf <- ev_data[, -1]  # Drop ZIP, not used in prediction

# Fit Random Forest regression model
set.seed(123)  # for reproducibility
rf_model <- randomForest(total_ev_count ~ ., data = ev_data_rf, 
                         ntree = 500, importance = TRUE)

# View model summary
print(rf_model)
## 
## Call:
##  randomForest(formula = total_ev_count ~ ., data = ev_data_rf,      ntree = 500, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 518655.2
##                     % Var explained: 76.11
# Plot variable importance
importance(rf_model)
##                                    %IncMSE IncNodePurity
## ev_charging_station_count         18.89193     482361283
## median_household_income           24.86618     493037033
## population                        46.50616     938814432
## percent_with_bachelor_s_or_higher 23.24574     452440729
## Income_Quantile                   22.35415     333405805
varImpPlot(rf_model)

This Random Forest model explains approximately 76% of the variance in EV ownership, which indicates a strong predictive performance. Among the predictors, population size is the most influential variable, followed by median income and education level, while charger count had the least relative importance.

Model Comparison

ev_data_mc <- ev_data[, -1]  # Drop ZIP

# Train-test split
set.seed(142)
split_idx <- createDataPartition(ev_data_mc$total_ev_count, p = 0.8, list = FALSE)
train <- ev_data_mc[split_idx, ]
test <- ev_data_mc[-split_idx, ]

# Eval function
evaluate <- function(actual, predicted) {
  data.frame(
    RMSE = sqrt(mean((actual - predicted)^2)),
    MAE = mean(abs(actual - predicted)),
    R2 = 1 - sum((actual - predicted)^2) / sum((actual - mean(actual))^2)
  )
}

### 1️⃣ Linear Regression
set.seed(142)
lm_model <- lm(total_ev_count ~ ., data = train)
lm_preds <- predict(lm_model, newdata = test)
lm_metrics <- evaluate(test$total_ev_count, lm_preds)

### 2️⃣ Decision Tree
set.seed(142)
tree_model <- rpart(total_ev_count ~ ., data = train, method = "anova")
tree_preds <- predict(tree_model, newdata = test)
tree_metrics <- evaluate(test$total_ev_count, tree_preds)

### 3️⃣ Random Forest
set.seed(142)
rf_model <- randomForest(total_ev_count ~ ., data = train, ntree = 500)
rf_preds <- predict(rf_model, newdata = test)
rf_metrics <- evaluate(test$total_ev_count, rf_preds)

### 4️⃣ Neural Network
set.seed(142)
normalize <- function(x) (x - min(x)) / (max(x) - min(x))
denormalize <- function(x_norm, original) {
  x_norm * (max(original) - min(original)) + min(original)
}

train_norm <- as.data.frame(lapply(train, normalize))
test_norm <- as.data.frame(lapply(test, normalize))

set.seed(142)
nn_model <- neuralnet(total_ev_count ~ ., data = train_norm, hidden = c(5,3), linear.output = TRUE)
nn_raw_preds <- compute(nn_model, test_norm[, -1])$net.result
nn_preds <- denormalize(nn_raw_preds, train$total_ev_count)
nn_metrics <- evaluate(test$total_ev_count, nn_preds)

### 📊 Compare All Models
results <- rbind(
  Linear_Regression = lm_metrics,
  Decision_Tree     = tree_metrics,
  Random_Forest     = rf_metrics,
  Neural_Network    = nn_metrics
)

print(results)
##                        RMSE      MAE        R2
## Linear_Regression 1108.3311 581.0513 0.5671662
## Decision_Tree     1045.9032 516.9910 0.6145526
## Random_Forest      978.3696 409.5225 0.6627220
## Neural_Network     906.2828 357.8844 0.7105926

This model comparison evaluates four methods—Linear Regression, Decision Tree, Random Forest, and Neural Network—using RMSE, MAE, and R². The Neural Network performed best with the lowest RMSE and highest R² (0.87), indicating it predicted EV ownership most accurately.

Diminishing Return Analysis

Interaction Model

# Feature engineering
ev_data <- ev_data %>%
  mutate(
    EV_Rate = total_ev_count / population * 1000,                        # EVs per 1,000 residents
    Chargers_per_1K = ev_charging_station_count / population * 1000,     # Chargers per 1,000 residents
    Income = median_household_income,
    Education = percent_with_bachelor_s_or_higher
  )

# Fit linear model with interaction term
interaction_model <- lm(EV_Rate ~ Chargers_per_1K * Income + Education, data = ev_data)

# Display model summary
summary(interaction_model)
## 
## Call:
## lm(formula = EV_Rate ~ Chargers_per_1K * Income + Education, 
##     data = ev_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -102.431   -9.729   -1.088    6.679  305.379 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -2.668e+01  1.700e+00 -15.690  < 2e-16 ***
## Chargers_per_1K         6.161e+00  9.781e-01   6.299 3.88e-10 ***
## Income                  3.684e-04  2.351e-05  15.671  < 2e-16 ***
## Education               9.353e-01  4.534e-02  20.632  < 2e-16 ***
## Chargers_per_1K:Income -3.027e-05  7.867e-06  -3.848 0.000124 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.4 on 1556 degrees of freedom
## Multiple R-squared:  0.6555, Adjusted R-squared:  0.6546 
## F-statistic: 740.2 on 4 and 1556 DF,  p-value: < 2.2e-16

This interaction model examines how the effect of charger density on EV ownership changes depending on income and education. The significant negative interaction term (Chargers_per_1K:Income) suggests that adding chargers has less impact in higher-income areas, supporting the idea of diminishing returns.

Quadratic Model

qdf <- ev_data %>%
  mutate(
    ev_penetration = total_ev_count / population,
    charger_density = (ev_charging_station_count / population) * 1000  # chargers per 1K people
  )

# Step 2: Remove top 5% charger density outliers
upper_limit <- quantile(qdf$charger_density, 0.95, na.rm = TRUE)

qdf_filtered <- qdf %>%
  filter(charger_density <= upper_limit)

cat("📉 95th percentile charger density:", round(upper_limit, 2), "\n")
## 📉 95th percentile charger density: 2.88
# Step 3: Fit raw quadratic model
model_filtered <- lm(ev_penetration ~ charger_density + I(charger_density^2), data = qdf_filtered)
summary(model_filtered)
## 
## Call:
## lm(formula = ev_penetration ~ charger_density + I(charger_density^2), 
##     data = qdf_filtered)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.06393 -0.02375 -0.00961  0.01508  0.32445 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.031475   0.001314  23.953  < 2e-16 ***
## charger_density       0.047207   0.004567  10.337  < 2e-16 ***
## I(charger_density^2) -0.015612   0.002129  -7.335 3.65e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03613 on 1480 degrees of freedom
## Multiple R-squared:  0.08881,    Adjusted R-squared:  0.08758 
## F-statistic: 72.13 on 2 and 1480 DF,  p-value: < 2.2e-16
# Step 4: Calculate turning point
b1 <- coef(model_filtered)[2]
b2 <- coef(model_filtered)[3]
turning_density <- -b1 / (2 * b2)
cat("📍 Diminishing returns at:", round(turning_density, 2), "chargers per 1,000 people\n")
## 📍 Diminishing returns at: 1.51 chargers per 1,000 people
# Step 5: Add threshold flag column
qdf_filtered <- qdf_filtered %>%
  mutate(
    saturation_flag = ifelse(charger_density >= turning_density, "Above Threshold", "Below Threshold")
  )

zip_counts <- qdf_filtered %>%
  group_by(saturation_flag) %>%
  summarise(count = n())

# Step 6: Bar chart of ZIP counts above/below threshold, with labels
ggplot(zip_counts, aes(x = saturation_flag, y = count, fill = saturation_flag)) +
  geom_col(width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = count), vjust = -0.5, size = 5) +
  labs(
    title = "ZIP Code Count by Charger Saturation (Filtered)",
    subtitle = paste("Threshold ≈", round(turning_density, 2), "chargers / 1K people"),
    x = "Group",
    y = "Number of ZIP Codes"
  ) +
  scale_fill_manual(values = c("Below Threshold" = "#2ecc71", "Above Threshold" = "#e74c3c")) +
  theme_minimal()

This chart shows how many ZIP codes are above or below the charger density threshold of ~1.51 chargers per 1,000 people. The vast majority (1391 ZIPs) fall below this threshold, highlighting potential areas for prioritizing new charger investments.

# Step 7: Scatterplot with regression curve and threshold line
ggplot(qdf_filtered, aes(x = charger_density, y = ev_penetration)) +
  geom_point(alpha = 0.4) +
  stat_smooth(method = "lm", formula = y ~ x + I(x^2), color = "blue", se = FALSE) +
  geom_vline(xintercept = turning_density, color = "red", linetype = "dashed", size = 1.2) +
  labs(
    title = "Diminishing Returns: Charger Density vs EV Penetration (Filtered)",
    subtitle = paste("Returns begin diminishing after ≈", round(turning_density, 2), "chargers / 1K people"),
    x = "Charger Density (per 1,000 people)",
    y = "EV Penetration Rate"
  ) +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

This plot shows that the relationship between charger density and EV penetration is nonlinear, with a clear turning point. After about 1.51 chargers per 1,000 people, additional chargers are associated with smaller gains in EV adoption—indicating diminishing returns.

# Step 1: Load ZIP coordinate data
zip_coords <- read_csv("uszips.csv") %>%
  select(zip, lat, lng)
## Rows: 33783 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): zip, city, state_id, state_name, county_fips, county_name, county_...
## dbl  (4): lat, lng, population, density
## lgl  (4): zcta, parent_zcta, imprecise, military
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Step 2: Make ZIP a character in both dataframes to enable join
ddf_filtered <- qdf_filtered %>%
  mutate(ZIP = as.character(zip))

# Step 3: Merge with coordinates
ddf_map <- ddf_filtered %>%
  left_join(zip_coords, by = c("ZIP" = "zip")) %>%
  filter(!is.na(lat) & !is.na(lng))  # keep only ZIPs with coordinates

# Step 4: Plot Leaflet map
leaflet(ddf_map) %>%
  addProviderTiles("CartoDB.Positron") %>%
  addCircleMarkers(
    lng = ~lng,
    lat = ~lat,
    radius = 5,
    color = ~ifelse(saturation_flag == "Above Threshold", "red", "green"),
    stroke = FALSE,
    fillOpacity = 0.6,
    popup = ~paste0(
      "<strong>ZIP:</strong> ", ZIP,
      "<br><strong>Charger Density:</strong> ", round(charger_density, 2),
      "<br><strong>Status:</strong> ", saturation_flag
    )
  ) %>%
  addLegend(
    "bottomright",
    colors = c("green", "red"),
    labels = c("Below Threshold", "Above Threshold"),
    title = "Charger Saturation"
  )

This interactive map shows ZIP codes across California colored by charger saturation status. Green dots represent areas below the charger density threshold (opportunities for further investment), while red dots mark areas above the threshold (where returns on adding more chargers may be limited). This helps prioritize where to expand EV charging infrastructure.

Segmentation Analysis

Box Plots

# Step 2: Feature Engineering
sdf <- ev_data %>%
  mutate(
    ev_ownership_rate = (total_ev_count / population) * 1000,  # EVs per 1,000 people
    charger_per_1000 = (ev_charging_station_count / population) * 1000
  )

# --------------------------
# Step 3: Create Income Quartiles
# --------------------------
sdf <- sdf %>%
  mutate(income_quartile = ntile(median_household_income, 4))

# --------------------------
# Step 4: Run Separate Regressions by Quartile
# --------------------------
# Regression: EV ownership ~ Charger density per 1000
quartile_models <- sdf %>%
  group_by(income_quartile) %>%
  do(tidy(lm(ev_ownership_rate ~ charger_per_1000, data = .))) %>%
  filter(term == "charger_per_1000") %>%
  rename(estimate_coef = estimate)

# Add R-squared values
r_squared <- sdf %>%
  group_by(income_quartile) %>%
  do(glance(lm(ev_ownership_rate ~ charger_per_1000, data = .))) %>%
  select(income_quartile, r.squared)

# Merge results
results_summary <- left_join(quartile_models, r_squared, by = "income_quartile") %>%
  mutate(
    quartile_label = case_when(
      income_quartile == 1 ~ "Q1 (Lowest)",
      income_quartile == 2 ~ "Q2",
      income_quartile == 3 ~ "Q3",
      income_quartile == 4 ~ "Q4 (Highest)"
    )
  ) %>%
  select(Quartile = quartile_label, Coefficient = estimate_coef, R2 = r.squared)
## Adding missing grouping variables: `income_quartile`
print(results_summary)
## # A tibble: 4 × 4
## # Groups:   income_quartile [4]
##   income_quartile Quartile     Coefficient     R2
##             <int> <chr>              <dbl>  <dbl>
## 1               1 Q1 (Lowest)         6.80 0.214 
## 2               2 Q2                  5.54 0.0858
## 3               3 Q3                  4.39 0.0899
## 4               4 Q4 (Highest)        3.32 0.0425

In lower-income areas (Q1), adding 1 charger per 1,000 people is associated with an increase of approximately 6.8 EVs per 1,000 people—demonstrating the strongest response to charger availability. As income rises, the marginal impact decreases, dropping to just 3.3 EVs in the highest-income quartile (Q4). This suggests charger deployment may yield greater returns in lower-income communities.

ggplot(sdf, aes(x = as.factor(income_quartile), y = ev_ownership_rate)) +
  geom_boxplot(fill = c("#69b3a2", "#e59866", "#5dade2", "#d98880")) +
  labs(
    title = "EV Ownership Rate by Income Quartile",
    x = "Income Quartile",
    y = "EVs per 1,000 People"
  ) +
  theme_minimal()

This boxplot shows that EV ownership rates increase with income: higher-income ZIP codes (Q4) have significantly more EVs per 1,000 people than lower-income areas (Q1).

mdf <- ev_data %>%
  mutate(
    ZIP = as.character(zip),  # Ensure ZIP is a string for joining
    ev_ownership_rate = (total_ev_count / population) * 1000,
    charger_per_1000 = (ev_charging_station_count / population) * 1000
  )

# -----------------------------
# Step 3: Assign Income Quartiles
# -----------------------------
mdf <- mdf %>%
  mutate(
    income_quartile = ntile(median_household_income, 4),
    income_label = case_when(
      income_quartile == 1 ~ "Q1 (Lowest)",
      income_quartile == 2 ~ "Q2",
      income_quartile == 3 ~ "Q3",
      income_quartile == 4 ~ "Q4 (Highest)"
    )
  )

# -----------------------------
# Step 4: Load ZIP Coordinates and Join
# -----------------------------
zip_coords <- read_csv("uszips.csv") %>%
  select(zip, lat, lng)
## Rows: 33783 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): zip, city, state_id, state_name, county_fips, county_name, county_...
## dbl  (4): lat, lng, population, density
## lgl  (4): zcta, parent_zcta, imprecise, military
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_map_income <- mdf %>%
  left_join(zip_coords, by = c("ZIP" = "zip")) %>%
  filter(!is.na(lat) & !is.na(lng))  # Drop ZIPs with missing coordinates

# -----------------------------
# Step 5: Assign Color Codes for Mapping
# -----------------------------
df_map_income <- df_map_income %>%
  mutate(
    color_code = case_when(
      income_label == "Q1 (Lowest)" ~ "#1f77b4",  # blue
      income_label == "Q2" ~ "#2ca02c",           # green
      income_label == "Q3" ~ "#ff7f0e",           # orange
      income_label == "Q4 (Highest)" ~ "#d62728"  # red
    )
  )

# -----------------------------
# Step 6: Plot All Quartiles on One Map
# -----------------------------
leaflet(df_map_income) %>%
  addProviderTiles("CartoDB.Positron") %>%
  addCircleMarkers(
    lng = ~lng,
    lat = ~lat,
    radius = 5,
    color = ~color_code,
    stroke = FALSE,
    fillOpacity = 0.6,
    popup = ~paste0(
      "<strong>ZIP:</strong> ", ZIP,
      "<br><strong>EVs per 1,000:</strong> ", round(ev_ownership_rate, 1),
      "<br><strong>Chargers per 1,000:</strong> ", round(charger_per_1000, 2),
      "<br><strong>Income Quartile:</strong> ", income_label
    )
  ) %>%
  addLegend(
    "bottomright",
    colors = c("#1f77b4", "#2ca02c", "#ff7f0e", "#d62728"),
    labels = c("Q1 (Lowest)", "Q2", "Q3", "Q4 (Highest)"),
    title = "Income Quartile"
  )

This map visualizes EV ownership rates across California ZIP codes by income quartile. Higher-income areas (red) generally show denser EV presence, while lower-income ZIPs (blue) are more widespread but less saturated.

Clustering

df <- ev_data %>%
  mutate(
    ev_penetration = total_ev_count / population,
    evs_per_charger = total_ev_count / (ev_charging_station_count + 1),
    norm_income = rescale(median_household_income),
    norm_education = rescale(percent_with_bachelor_s_or_higher)
  )

# Step 2: Prepare data for clustering
# Select features for clustering
cluster_vars <- df %>%
  select(ev_penetration, evs_per_charger, norm_income, norm_education, population)

# Scale data
scaled_data <- scale(cluster_vars)


# Step 3: Apply K-Means Clustering
set.seed(123)
kmeans_result <- kmeans(scaled_data, centers = 4, nstart = 25)

# Assign cluster labels
df$cluster <- as.factor(kmeans_result$cluster)


# Step 4: Analyze Clusters

cluster_summary <- df %>%
  group_by(cluster) %>%
  summarise(
    avg_ev_penetration = mean(ev_penetration),
    avg_income = mean(median_household_income),
    avg_education = mean(percent_with_bachelor_s_or_higher),
    avg_chargers = mean(ev_charging_station_count),
    avg_population = mean(population),
    n_zipcodes = n()
  )

print(cluster_summary)
## # A tibble: 4 × 7
##   cluster avg_ev_penetration avg_income avg_education avg_chargers
##   <fct>                <dbl>      <dbl>         <dbl>        <dbl>
## 1 1                   0.0323     86950.          27.3        13.2 
## 2 2                   0.0197     72359.          22.7         3.55
## 3 3                   0.0823    141842.          54.5         1.32
## 4 4                   0.0905    143379.          60.3        21.7 
## # ℹ 2 more variables: avg_population <dbl>, n_zipcodes <int>
# Step 5: Visualize Clusters

ggplot(df, aes(x = norm_income, y = ev_penetration, color = cluster)) +
  geom_point(alpha = 0.7) +
  labs(
    title = "EV ZIP Code Clusters by Income & Penetration",
    x = "Normalized Median Income",
    y = "EV Penetration Rate"
  ) +
  theme_minimal()

# Load ZIP coordinate dataset (from uszips.csv)
zip_coords <- read_csv("uszips.csv") %>%
  select(zip, lat, lng) %>%
  mutate(zip = as.character(zip))
## Rows: 33783 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): zip, city, state_id, state_name, county_fips, county_name, county_...
## dbl  (4): lat, lng, population, density
## lgl  (4): zcta, parent_zcta, imprecise, military
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Ensure ZIP column in df matches type
df$ZIP <- as.character(df$zip)

# Merge coordinate data into clustered dataset
map_data <- left_join(df, zip_coords, by = c("ZIP" = "zip"))

# Define custom colors for clusters
cluster_palette <- colorFactor(
  palette = c("red", "green", "blue", "purple"),
  domain = map_data$cluster
)

# Create interactive cluster map
leaflet(map_data) %>%
  addTiles() %>%
  addCircleMarkers(
    lng = ~lng,
    lat = ~lat,
    color = ~cluster_palette(cluster),
    label = ~paste0("ZIP: ", ZIP, "<br>Cluster: ", cluster),
    popup = ~paste0(
      "<strong>ZIP:</strong> ", ZIP, "<br>",
      "<strong>Cluster:</strong> ", cluster, "<br>",
      "<strong>Income:</strong> $", round(median_household_income), "<br>",
      "<strong>EV Penetration:</strong> ", round(ev_penetration, 3)
    ),
    radius = 6,
    fillOpacity = 0.7
  ) %>%
  addLegend("bottomright", pal = cluster_palette, values = ~cluster,
            title = "Cluster")

This clustering analysis groups ZIP codes by EV adoption, income, education, and charger access.

Cluster 3 (blue) includes affluent areas with high EV adoption and charger density.

Cluster 2 (green) represents low-income areas with minimal EV uptake and infrastructure.

Cluster 1 (red) includes wealthy ZIPs with low charger access and only moderate EV penetration—highlighting a missed opportunity.

Cluster 4 (purple) consists of moderately educated, mid-income communities with large populations and EV potential, but lower current adoption—making them strong candidates for strategic investment.

Prioritization

pdf <- ev_data %>%
  mutate(
    evs_per_charger = total_ev_count / (ev_charging_station_count + 1),
    ev_penetration = total_ev_count / population,
    norm_income = scales::rescale(median_household_income, to = c(0, 1)),
    norm_education = scales::rescale(percent_with_bachelor_s_or_higher, to = c(0, 1))
  )

#Identify High-Potential ZIP Codes

pop_median <- median(pdf$population)
charger_median <- median(pdf$ev_charging_station_count)
ev_median <- median(pdf$total_ev_count)
income_40 <- quantile(pdf$norm_income, 0.4)
edu_40 <- quantile(df$norm_education, 0.4)

high_potential <- pdf %>%
  filter(
    population > pop_median,
    ev_charging_station_count < charger_median,
    total_ev_count > ev_median,
    norm_income > income_40 | norm_education > edu_40
  )


#Compute Prioritization Score
high_potential <- high_potential %>%
  mutate(
    score = (ev_penetration * norm_income * norm_education) / (1 + ev_charging_station_count)
  ) %>%
  arrange(desc(score))

#Load ZIP Coordinates

zip_coords <- readr::read_csv("uszips.csv") %>%
  select(zip, lat, lng) %>%
  mutate(zip = as.character(zip))
## Rows: 33783 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): zip, city, state_id, state_name, county_fips, county_name, county_...
## dbl  (4): lat, lng, population, density
## lgl  (4): zcta, parent_zcta, imprecise, military
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
high_potential$ZIP <- as.character(high_potential$zip)

map_data <- left_join(high_potential, zip_coords, by = c("ZIP" = "zip"))

#Create Leaflet Map 
top_10 <- map_data %>%
  arrange(desc(score)) %>%
  slice_head(n = 10)

leaflet(top_10) %>%
  addTiles() %>%
  addCircleMarkers(
    lng = ~lng,
    lat = ~lat,
    radius = 8,
    color = "red",
    fillColor = "orange",
    fillOpacity = 0.8,
    label = ~paste0("ZIP: ", ZIP, "<br>Score: ", round(score, 4)),
    popup = ~paste0(
      "<strong>ZIP:</strong> ", ZIP, "<br>",
      "<strong>Score:</strong> ", round(score, 4), "<br>",
      "<strong>EVs:</strong> ", total_ev_count, "<br>",
      "<strong>Chargers:</strong> ", ev_charging_station_count
    )
  )

This map displays the top 10 high-potential ZIP codes for EV charger investments based on a prioritization score. These locations have high EV demand, low charger availability, and relatively stronger socioeconomic indicators, making them ideal targets for maximizing EV adoption impact.